NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

MADGEN: Mass-Spec attends to De Novo Molecular generation

Wang, Yinkai; Chen, Xiaohui; Liu, Liping; Hassoun, Soha (April 2025, The Thirteenth International Conference on Learning Representations)

Free, publicly-accessible full text available April 28, 2026
Graph Generative Pre-trained Transformer

Chen, Xiaohui; Wang, Yinkai; He, Jiaxing; Du, Yuanqi; Hassoun, Soha; Xu, Xiaolin; Liu, Liping (January 2025, arxiv.org)

Full Text Available
On Separate Normalization in Self-supervised Transformers

Chen, Xiaohui; Wang, Yinkai; Du, Yuanqi; Hassoun, Soha; Liu, Li-Ping (December 2023, Advances in Neural Information Processing Systems 36)

Self-supervised training methods for transformers have demonstrated remarkable performance across various domains. Previous transformer-based models, such as masked autoencoders (MAE), typically utilize a single normalization layer for both the [CLS] symbol and the tokens. We propose in this paper a simple modification that employs separate normalization layers for the tokens and the [CLS] symbol to better capture their distinct characteristics and enhance downstream task performance. Our method aims to alleviate the potential negative effects of using the same normalization statistics for both token types, which may not be optimally aligned with their individual roles. We empirically show that by utilizing a separate normalization layer, the [CLS] embeddings can better encode the global contextual information and are distributed more uniformly in its anisotropic space. When replacing the conventional normalization layer with the two separate layers, we observe an average 2.7% performance improvement over the image, natural language, and graph domains.
more » « less
Dataset Geography: Mapping Language Data to Language Users

https://doi.org/10.18653/v1/2022.acl-long.239

Faisal, Fahim; Wang, Yinkai; Anastasopoulos, Antonios (May 2022, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))

Full Text Available
Small molecule generation via disentangled representation learning

https://doi.org/10.1093/bioinformatics/btac296

Du, Yuanqi; Guo, Xiaojie; Wang, Yinkai; Shehu, Amarda; Zhao, Liang; Xu, ed., Jinbo (May 2022, Bioinformatics)

Abstract MotivationExpanding our knowledge of small molecules beyond what is known in nature or designed in wet laboratories promises to significantly advance cheminformatics, drug discovery, biotechnology and material science. In silico molecular design remains challenging, primarily due to the complexity of the chemical space and the non-trivial relationship between chemical structures and biological properties. Deep generative models that learn directly from data are intriguing, but they have yet to demonstrate interpretability in the learned representation, so we can learn more about the relationship between the chemical and biological space. In this article, we advance research on disentangled representation learning for small molecule generation. We build on recent work by us and others on deep graph generative frameworks, which capture atomic interactions via a graph-based representation of a small molecule. The methodological novelty is how we leverage the concept of disentanglement in the graph variational autoencoder framework both to generate biologically relevant small molecules and to enhance model interpretability. ResultsExtensive qualitative and quantitative experimental evaluation in comparison with state-of-the-art models demonstrate the superiority of our disentanglement framework. We believe this work is an important step to address key challenges in small molecule generation with deep generative frameworks. Availability and implementationTraining and generated data are made available at https://ieee-dataport.org/documents/dataset-disentangled-representation-learning-interpretable-molecule-generation. All code is made available at https://anonymous.4open.science/r/D-MolVAE-2799/. Supplementary informationSupplementary data are available at Bioinformatics online.
more » « less
Multi-objective Deep Data Generation with Correlated Property Control

Wang, Shiyu; Guo, Xiaojie; Lin, Xuanyang; Pan, Bo; Du, Yuanqi; Wang, Yinkai; Ye, Yanfang; Petersen, Ashley; Leitgeb, Austin; Alkhalifa, Saleh; et al (January 2022, Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS))

Full Text Available
Multi-objective Deep Data Generation with Correlated Property Control

Wang, Shiyu; Guo, Xiaojie; Lin, Xuanyang; Pan, Bo; Du, Yuanqi; Wang, Yinkai; Ye, Yanfang; Petersen, Ashley; Leitgeb, Austin; Alkhalifa, Saleh; et al (January 2022, Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS))

Full Text Available
Multi-objective Deep Data Generation with Correlated Property Control

Wang, Shiyu; Guo, Xiaojie; Lin, Xuanyang; Pan, Bo; Du, Yuanqi; Wang, Yinkai; Ye, Yanfang; Petersen, Ashley; Leitgeb, Austin; Alkhalifa, Saleh; et al (January 2022, Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS))

Full Text Available
Deep Latent-Variable Models for Controllable Molecule Generation

https://doi.org/10.1109/BIBM52615.2021.9669692

Du, Yuanqi; Wang, Yinkai; Alam, Fardina; Lu, Yuanjie; Guo, Xiaojie; Zhao, Liang; Shehu, Amarda (January 2021, 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM))

Representation learning via deep generative models is opening a new avenue for small molecule generation in silico. Linking chemical and biological space remains a key challenge. In this paper, we debut a graph-based variational autoencoder framework to address this challenge under the umbrella of disentangled representation learning. The framework permits several inductive biases that connect the learned latent factors to molecular properties. Evaluation on diverse benchmark datasets shows that the resulting models are powerful and open up an exciting line of research on controllable molecule generation in support of cheminformatics, drug discovery, and other application settings.
more » « less
Full Text Available

Search for: All records